1 Probability Theory for Data Scientists
1.1 Set theory Concepts
Example 1.1 If we flip a coin twice then the sample space can be written as: \[ \Omega =\{HH,HT,TH,TT\} \]
where \(H\) represents heads and \(T\) tails. This sample space is finite. An event (say \(A\)) could be at least one head appears, that is
\[A =\{HT,TH,HH\}\subset \Omega\]
Example 1.2 If we are analyzing customer purchase behavior for a single online transaction, the sample space could be the set of all possible combinations of items a customer might select from the store’s catalog. This sample space is in principle finite and therefore countable. However, if the catalog is very large, the sample space can be considered uncontably large for practical purposes. More on this later.
An event could be “customer buys at least one item from category X”, or “customer buys product Y”.
Example 1.3 We measure the time (in seconds) it takes for a user to complete a task on a website. The time limit is predefined at 5 minutes. Then the sample space is \(\Omega = \{0, 1, 2, 3,\ldots, 300\}\) which is finite. If, however, we measure the time with arbitrary precision, then the sample space is the interval \((0,300)\) of real numbers. This sample space is uncountable.
An event could be “user completes the task in under 2 minutes”. In the former case, this corresponds to the set \(A=\{1,2\ldots, 119 \}\). Int he latter case is the real interval \(A=(0, 120)\).
Events can be described in many different ways. We will use set theory and notation to describe events and operations on events. This can help later in the computation of probabilities.
Note the definition of sigma-algebra does not explicitly require that the intersection of two sets in \(\mathcal F\) is also in \(\mathcal F\). However, this property follows from the other properties and De Morgan’s laws. For example, if \(A,B \in \mathcal F\), then
\[ A\cup B \in \mathcal F \implies (A\cup B)^c = A^c \cap B^c \in \mathcal F \implies (A^c \cap B^c)^c = A \cup B \in \mathcal F \]
1.2 Probability
We will start by defining probability in an intuitive way. Later we will give a more formal mathematical definition .
1.2.1 Types of Probability
There are several ways to think about probability. These include
Classical Probability: Assumes all possible outcomes in a finite sample space are equally likely. That is, for any event \(A\) with \(n(A)\) outcomes in a sample space \(\Omega\) with \(n(\Omega)\) equally likely outcomes, the probability of \(A\) is: \[ P(A) = \frac{n(A)}{n(\Omega)} \]
Example 1.6 Under this framework, the probability of rolling an even number on a die is assigned to be \(P(\text{rolling an even number}) = \frac{3}{6}\). More, generally this is equivalent to say the die is fair. Another example is when we assign the probability of rain tomorrow, locally at 10 AM, to be 1/2 as there are only two possible outcomes: rain or no rain.
Empirical (or Frequentist) Probability: Based on observed frequencies from repeated experiments. As the number \(N\) of experiment repetitions increases, the probability of an event \(A\) approaches the true probability: \[ P(A) \approx \frac{\text{Number of times A occurred}}{N} \]
Example 1.7 If we do not what the probability of heads when flipping a coin is. We can we flip the coin 1000 times and if it lands heads 537 times, we would say the empirical probability of heads is \(0.537\). Furthermore we might say that the true probability of heads is \(\approx 0.537\) and the important aspect of thios framework is that, in theory, the more times we flip the coin the closer the empirical proportion will be to the true probability. Finally, according to historical data for our location, it has rained 33.6% of the days out of the last 10 years. The empirical probability of rain tomorrow locally at 10 AM is 0.336.
Subjective Probability: Based on personal belief or judgment, often used when objective data is scarce.
Example 1.8 I had a look through the window and is a bit overcast, then I believe the probability of rain tomorrow locally at 10 AM is 0.7. On the other hand, if I am a weather expert from the point of atmospheric physics, I might believe the probability of rain tomorrow locally at 10 AM is 0.9.
1.2.2 Formal definition of probability
After we have chosen a sigma algebra \(\mathcal F\) that contains events we are interested in, we can define probabilities for all the events in a more formal way.
These inequalities, specially Bonferroni’s will be useful later. Booles inequality can be proved by induction and Bonferroni’s inequality follows from Booles inequality and the properties of complements. These facts can be verified by the reader.
1.3 Conditional Probability
Two very useful consequences of the above are: the law of total probability that combines the notion of partition with that of conditional probability and Bayes rule that allows us to reverse conditional probabilities.
The proof of this result is staightforward and left to the reader.
The Law of Total probability allows us to calculate the probability of an event \(A\) by considering the different ways it can occur through the events in a partition.
1.4 Independence
1.4.1 Independence of events
First intuitively, two events, \(A\) and \(B\), are considered independent if the occurrence of one event does not affect the probability of the other event occurring. Formally,
This means that knowing that event \(B\) has occurred gives us no new information about the probability of event \(A\) occurring, and vice versa.
An obvious consequence of Definition 1.6 of conditional probability is the so-called multiplication rule.
The outcome of flipping a coin maybe independent of the outcome of any previous coin flips. If you flip a coin and get heads, the probability of getting heads on the next flip should remain as before. Of course, this a simplifying assumption that may not hold in practice. In this course we will make these kind of assumption specially when it involves sequences of events. Not assuming independence for sequences of events may things more complicated for what we want to achieve in this course.
1.5 Random Variables
So far we have talked about events, which are subsets of the sample space. In many applications, especially in data science, we are interested in quantifying outcomes numerically. This is where random variables come into play.
Note a random variable evaluates from the outcomes in the sample space rather than the events defined in the sigma algebra. However, events can be defined in terms of random variables. For example, the event “the total amount spent in an order is greater than 50 dollars” can be expressed as \(\{X > 50\}\).
The above two random variables are discrete random variables as they take on a finite or countable number of values, that is \(X(\Omega)\) is finite or countable. There are also continuous random variables that can take on any value in a continuous range.
The definition of continuous random variables requires a bit more than simply having an uncountably infinite image set \(X(\Omega)\) . The definition is a bit thechnical as it involves the notion of probability density function.
The idea is that there are no “gaps”, which would correspond to real numbers which have a finite probability of occurring. Instead, continuous random variables never take an exact prescribed value, that is \(P(X=x)=0\) for all \(x\) but there is a positive probability that its value will lie in particular intervals which can be arbitrarily small.
We note that
\[ \begin{aligned} P(a \leq X <b ) &= F_X(b) - F_X(a)\\ \rule{0in}{4ex} &= \int_a^b f_X(x) dx \end{aligned} \]
so that
\[ f_X(x) = \frac{d}{dx} F_X(x)\qquad \forall x \in \mathbb R \]
The properties above hold for both discrete and continuous random variables. For discrete random variables, the CDF is a step function (continuous from the right), while for continuous random variables, the CDF is a continuous function.